Dataset Summary

This dataset consists of 1.5M beer reviews from the website Beeradvocate.com. The set includes various pieces of information about the beer and the reviewers impressions of the taste. I chose to focus my analysis on the ABV, beer name, beer style and the overall impression categories.

Source - https://data.world/socialmediadata/beeradvocate

At a glance

Prior to performing any analysis I chose to do some cleaning of the dataset. I noticed there were a large number of beers with only 1 review. I took these as being reviewed erroneously due to a spelling error in the beers name or some other mistake by the user. Prior to removing these from the dataset there was a total of ~57K unique beers with an average of 28 reviews per beer. After removing all beers with a single review the total distinct beers sits at ~38K with an average of 41 reviews per beer. The average ABV across the dataset is 7.1 and the max is 43, that’s one seriously strong beer!

Top 10 by ABV

Since I just spoke about ABV’s, lets dive a little deeper into that category of the data. The following is a table displaying the 10 top ABV’s determined by the total number of beers reviewed with that ABV.

Clocking in almost twice as many reviews as number two are beers with an ABV of 5%. Some of the most notable beers in this category are Budweiser, Stella Artois, Heineken and Miller Highlife. These beers are all highly available in many stores and bars, so it’s no wonder they have so many reviews!

While 5% may be the most reviewed, it’s actually the worst in terms of average rating with a score of 3.64. The highest average rating is 9% ABV with a score of 3.96, followed by 7.5% at 3.94, then a three way tie between 7%, 8% and 10% at 3.93.

Another interesting tidbit is that these top 10 ABV’s make up 38.3% of the total dataset!

paste(round(sum(abv.top10) / nrow(beer) * 100, digits = 1), "%", sep = '')
## [1] "38.3%"

Top 10 beers by name

Let’s shift focus now and move over to the top 10 beers by name determined by the total number of reviews.

As you can see from the chart below, Sierra Nevada brewing managed to achieve two different beers in the top 10 with their Celebration Ale coming in at number 3 with 3000 reviews and their Pale Ale at number 7 with almost 2600.

All 10 of these beers scored above a 4.0 average in the overall impression category with the lowest being the Arrogant Bastard ale from Stone brewing at 4.1 and the highest being Pliny The Elder from Russian River Brewing at 4.6.

Top 10 beers by style

Next up is our top 10 most reviewed beers by style. Our front runner here by over 30K reviews is the American IPA style with ~117K reviews, followed up by the double IPA at ~86K. The average review rating in this category is more diverse than the previous with the highest average being the American Double or Imperial Stout coming in at 4.03, and last place being the Fruit or Vegetable beer getting an average overall impression score of 3.42, a spread of 0.61 compared to 0.5 for the previous category.

ABV Distribution by style

Diving into the ABV distributions for the top 10 styles, we can see there are some pretty large differences between the group. Each has a slightly different quantile range and number of outliers.

The American Pale, IPA and Porter all have a short box indicating their inter-quartile range is relatively small. While on the other hand the American Strong Ale and the Imperial Stout both have a tall box, indicating their average range is larger. These are also the only two styles that do not have any outliers below their lower whisker.

Both the Imperial Stout and Imperial IPA have the widest range at over a 30 point difference between their minimum and maximum ABV.

Overall rating distribution

Looking at the distribution for the rating provided by users in the “overall review” column we see the data is left skewed with the largest portions residing between 3.5-4.5 and 4 being the peak at nearly twice as high as 3.5 or 4.5.

Central Limit Theorem

Let’s further investigate this data by applying the central limit theorem. This theorem states that even if a distribution of sample means is not normally distributed, their normalized sums tends toward a normal distribution. This means that as the sample size increases and the sums are normalized, the distribution will also become more and more normalized.

Below are four histograms created using 1000 random samples from the “overall review” data with a sample size of 10, 50, 100 and 200. As the sample size increases the distribution becomes increasingly normalized, forming a bell shaped curve.

## population mean: 3.8 sd: 0.72
## Sample size 10, mean: 3.8 sd: 0.23
## Sample size 50, mean: 3.8 sd: 0.1
## Sample size 100, mean: 3.8 sd: 0.07
## Sample size 200, mean: 3.8 sd: 0.05

Sampling of review ratings

By utilizing sampling techniques we are able to extract portions of a datasets to test theories or perform analysis work without having to utilize the entire dataset, all while maintaining the proportions of the original data. This can prove to be invaluable when the original dataset is too large to process for one reason or another. Care needs to be taken to ensure that the sampling methods used maintains a similar distribution as the original dataset or the analysis results could be severely skewed.

Below I show charts based upon the “overall review” data using the original data and three different sampling techniques. Simple Random sampling without replacement, Systematic Sampling and stratified sampling.

Simple Random without Replacement randomly selects from the dataset and does not return the data back to the set to be chosen again. This ensures each row selected will be unique and will not cause the sample to be skewed.

Systematic sampling selects a number which will represent the first item to be included in the sample, then utilizes a sampling interval which determines how many items to skip over before selecting the next item for the sample. This is continued until the end of the population is reached.

The last method is stratified sampling. First the population is divided into mutually exclusive sub-groups and from within those groups, members are randomly chosen to participate in the sample.

Please note that each chart appears very similar in shape, but the number of samples taken for each is drastically different. The original dataset has 1.56M rows, but only 250K were used for systematic, 30K for Simple random sampling without replacement and 1k for stratified sampling.

While the shapes are very similar, upon closer inspection you can see that there are some variances. Within the systematic sampling chart the 3.5 value is only ~43K but the 4.5 is ~60k. On all of the other charts these two very close to being the same number. As I mentioned before sampling can prove to be invaluable when a dataset is simply too large to work with, but caution needs to be exercised as it can cause proportional skewing.

Conclusion

This is a large and diverse dataset with many avenues to explore beyond what I touched upon in this report. I could see this data being used for data mining or predictive analytics to help breweries reach new customers who have liked other similar beers, or perhaps to brew a new beer in a popular style or ABV where they do not currently have a product.